Add Output Area crosswalk and geographic assignment (Phase 1)#291
Add Output Area crosswalk and geographic assignment (Phase 1)#291vahid-ahmadi wants to merge 2 commits intomainfrom
Conversation
Port the US-side clone-and-prune calibration methodology to the UK, starting with Output Area (OA) level geographic infrastructure: - Build unified UK OA crosswalk from ONS, NRS, and NISRA data (235K areas: 189K E+W OAs + 46K Scotland OAs) - Population-weighted OA assignment with country constraints - Constituency collision avoidance for cloned records - Tests validating crosswalk completeness and assignment correctness This is Phase 1 of a 6-phase pipeline to enable OA-level calibration, analogous to the US Census Block approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Hi Vahid,
Most of this is from our boy Claude, as usual. This looks like a great setup! Can't wait to see HHs getting donated to the OAs! I'll approve, but please see the issues Claude found below.
Here's the code I used to poke around:
from policyengine_uk_data.calibration.oa_crosswalk import load_oa_crosswalk
xw = load_oa_crosswalk()
xw
# Population-weighted sampling demo
import numpy as np
xw["population"] = xw["population"].astype(float)
eng = xw[xw["country"] == "England"].copy()
eng["prob"] = eng["population"] / eng["population"].sum()
rng = np.random.default_rng(42)
idx = rng.choice(len(eng), size=10_000, p=eng["prob"].values)
sampled = eng.iloc[idx]
sampled.groupby("oa_code")["population"].agg(["count", "first"]).rename(
columns={"count": "times_sampled", "first": "population"}
).sort_values("times_sampled", ascending=False).head(20)
leads to:
Out[1]:
times_sampled population
oa_code
E00179944 5 3354.0
E00035641 3 279.0
E00039569 3 263.0
E00066618 3 331.0
E00115325 2 319.0
E00136307 2 301.0
E00089585 2 333.0
E00167257 2 472.0
E00130843 2 406.0
E00021422 2 190.0
E00004742 2 313.0
E00044937 2 294.0
E00089725 2 240.0
E00044974 2 400.0
E00160095 2 401.0
E00016512 2 305.0
E00016490 2 380.0
E00089915 2 514.0
E00021502 2 396.0
E00105618 2 305.0
Interesting: "E00179944 with population 3,354 is a massive outlier (most OAs are 100–300 people)"
Bugs
1. load_oa_crosswalk loads population as string
load_oa_crosswalk() passes dtype=str for all columns (line 753 of oa_crosswalk.py), so population comes back as a string. This means any downstream arithmetic (e.g. computing probabilities) fails with TypeError: unsupported operand type(s) for /: 'str' and 'str'. Should either drop dtype=str or explicitly cast population to int on load.
2. NI households silently get no assignment
The crosswalk has 0 NI rows (NISRA 404), which is acknowledged, but assign_random_geography will silently produce None entries for NI households (country code 4). Worth either raising an error or logging a warning when a household's country has no distribution.
Code quality
3. Dead code in _assign_regions
Lines 602–606 of oa_crosswalk.py:
for k, v in la_to_region.items():
if k[:3] == la_code[:3]:
# Same LA type prefix
passThis loop does nothing — should be removed or finished.
4. Assignment inner loop should be vectorised
In oa_assignment.py lines 236–245, the for i, pos in enumerate(positions) loop storing results can be replaced with vectorised numpy indexing:
oa_codes[start + positions] = dist["oa_codes"][indices]Same for all the other arrays. Will matter when n_clones * n_records gets large.
Worth noting
5. Scotland population weighting is effectively uniform
The fallback of ~117 per OA for all 46k Scottish OAs means population-weighted sampling is actually uniform for Scotland. This undermines the premise for ~20% of UK OAs. Might be worth a louder warning or a TODO to revisit once NRS fixes the 403.
baogorek
left a comment
There was a problem hiding this comment.
Approving Phase 1 — the crosswalk and assignment engine look good. Please see my comment above for a few things to address before merge.
Background
This PR implements Phase 1 of a 6-phase pipeline to enable Output Area (OA) level calibration — the UK equivalent of the US Census Block approach.
Why are we doing this?
The US pipeline (
policyengine-us-data) uses a clone-and-prune approach that produces much finer geographic granularity than our current UK methodology:This PR is going down to Output Area level (~235K OAs across the UK), which is the UK equivalent of the US Census Block. This PR is the first step.
What this PR does (Phase 1: OA Crosswalk & Geographic Assignment)
1. Unified UK Output Area Crosswalk
Downloads and combines geographic lookups from three national statistics agencies into a single crosswalk:
```
OA → LSOA/DataZone → MSOA/IntermediateZone → LA → Constituency → Region → Country
```
Data sources:
Output: `storage/oa_crosswalk.csv.gz` (1.4MB compressed) — 235,243 areas, 65M population, 632 constituencies, 363 LAs, 11 regions
2. Geographic Assignment Engine
Assigns population-weighted random Output Areas to cloned FRS household records, with two key constraints:
3. Tests — 19 passing, 1 skipped (NI)
Validates crosswalk completeness (OA counts, population totals, hierarchy nesting, country prefixes) and assignment correctness (country constraints, collision avoidance, population-weighted sampling, save/load roundtrip).
Known limitations
What comes next (Phases 2-6)
Phase 2: Clone-and-Assign
Clone each FRS household N times (start with N=10), assign each clone a different OA. Insert into `create_datasets.py` after imputations, before calibration.
US ref: PRs #457, #531
Phase 3: L0 Calibration Engine
Port L0-regularized optimization from US side. HardConcrete gates to actively drop records, producing sparse datasets. Add `l0-python` dependency.
US ref: PRs #364, #365
Phase 4: Sparse Matrix Builder
Build sparse `(n_targets × n_records*n_clones)` calibration matrix. Simulate PolicyEngine-UK per clone, wire existing `targets/sources/` into sparse matrix rows.
US ref: PRs #456, #489
Phase 5: SQLite Target Database
Hierarchical target storage: UK → Country → Region → LA → Constituency → MSOA → LSOA → OA. Migrate existing CSV/Excel targets into SQLite.
US ref: PRs #398, #488
Phase 6: Local Area Publishing
Generate per-area H5 files from sparse weights. Modal integration for scale.
US ref: PR #465
File summary